File Format

A file format is just a way to define how information is stored in HDFS file system. This is usually driven by the use case or the processing algorithms for specific domain, File format should be well-defined and expressive. It should be able to handle variety of data structures specifically structs,records, maps, arrays along with strings, numbers etc. File format should be simple, binary and compressed. When dealing with Hadoop’s file system not only do you have all of these traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data. A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints. The various Hadoop file formats have evolved as a way to ease these issues across a number of use cases.

Choosing an appropriate file format can have some significant benefits:

  • Faster read times 
  • Faster write times 
  • Splittable files (so you don’t need to read the whole file, just a part of it) 
  • Schema evolution support (allowing you to change the fields in a dataset) 
  • Advanced compression support (compress the columnar files with a compression codec without sacrificing these features) Some file formats are designed for general use (like MapReduce or Spark), others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice.

The file format in Hadoop roughly divided into two categories:
  •  Row-oriented
  •  Column-oriented:
Row-oriented:
The same row of data stored together that is continuous storage: SequenceFile, MapFile, Avro Datafile. In this way, if only a small amount of data of the row needs to be accessed, the entire row needs to be read into the memory. Delaying the serialization can lighten the problem to a certain amount, but the overhead of reading the whole row of data from the disk cannot be withdrawn. Row-oriented storage is suitable for situations where the entire row of data needs to be processed simultaneously.

Column-oriented:
The entire file cut into several columns of data, and each column of data stored together: Parquet, RCFile, ORCFile. The column-oriented format makes it possible to skip unneeded columns when reading data, suitable for situations where only a small portion of the rows are in the field. But this format of reading and write requires more memory space because the cache line needs to be in memory (to get a column in multiple rows). At the same time, it is not suitable for streaming to write, because once the write fails, the current file cannot be recovered, and the line-oriented data can be resynchronized to the last synchronization point when the write fails, so Flume uses the line-oriented storage format.

Picture 1. Show the Logical Table

Row-Oriented Layout(Sequence File)
Column-oriented Layout (RC File)

If it still not clear what is row and column oriented, don’t worry, you can visit this link and know the difference between them.
Here are a few related file formats that are widely used on the Hadoop system:

https://towardsdatascience.com/new-in-hadoop-you-should-know-the-various-file-format-in-hadoop-4fcdfa25d42b




RC (Row-Columnar) File Input Format

RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows used when we want to perform operations on multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares much similarity with SEQUENCE FILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage. This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type. We cannot load data into RCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created RCFILE.

columns stored separately
Read and decompressed only needed one.
Better compression
Columns stored as binary Blobs
Depend on Meta store to supply Data types
Large Blocks - 4MB default
Still search file for split boundary
ORC (Optimized Row Columnar)Input Format

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. We cannot load data into ORCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row Columnar File Format.ORC File format provides very efficient way to store relational data then RC file,By using ORC File format we can reduce the size of original data up to 75%.Comparing to Text,Sequence,Rc file formats ORC is better

Column stored separately
Knows Types - Uses Types specific en-coders
Stores statistics (Min,Max,Sum,Count)
Has Light weight Index
Skip over blocks of rows that that don’t matter
Larger Blocks - 256 MB by default, Has an index for block boundaries
Using ORC files improves performance when Hive is reading, writing, and processing data comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC takes less time to access the data comparing to RC File Format and ORC takes Less space space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. ORC File format feature comes with the Hive 0.11 version and cannot be used with previous versions.

AVRO Format

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like int/string. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Honestly, Avro is not really a file format, it’s a file format plus a serialization and de-serialization framework with regular old sequence files you can store complex objects but you have to manage the process. Avro handles this complexity whilst providing other tools to help manage data over time and is a well thought out format which defines file data schemas in JSON (for interoperability), allows for schema evolutions (remove a column, add a column), and multiple serialization/deserialization use cases. It also supports block-level compression. For most Hadoop-based use cases Avro becomes really good choice. Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Thrift & Protocol Buffers Vs. Avro

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways –

Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.
Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not in any proprietary IDL.
Parquet Format

The latest hotness in file formats for Hadoop is columnar file storage. Basically this means that instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records.

Design based on googles Dreamel paper
Schema segregated into footer
Column major format with stripes
Simple type-model with logical types
All data pushed to leaves of the tree
Integrated compression and indexes
Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records. Just like ORC file, it’s great for compression with great query performance especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). It supports both File-Level Compression and Block-Level Compression. File-level compression means you compress entire files regardless of the file format, the same way you would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed). Block-level compression is internal to the file format, so individual blocks of data within the file are compressed. This means that the file remains splittable even if you use a non-splittable compression codec like Snappy. However, this is only an option if the specific file format supports it. Summary Overall these format can drastically optimize workloads, especially for Hive and Spark which tend to just read segments of records rather than the whole thing (which is more common in MapReduce). Since Avro and Parquet have so much in common when choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet. Hopefully by now you’ve learned a little about what file formats actually are and why you would think of choosing a specific one. We’ve discussed the main characteristics of common file formats and talked a little about compression. This is not an exhaustive article that can answer all the use cases but definitely provide pointers for you to explore, so if you want to learn more about the particular codecs I would request you to visit their respective Apache / Wikipedia pages…

Let us see how we can sort the data globally using ORDER BY.

When we sort the data using ORDER BY, only one reducer will be used for sorting.
We can sort the data either in ascending or descending order.
By default data will be sorted in ascending order.
We can sort the data using multiple columns.
If we have sort the data in descending order on a particular column, then we need to specify DESC on that column.

How to handle small files in Hadoop.

Hadoop is a tool designed for larger files. But how do you handle small files?
There are two primary reasons why small files are problematic in Hadoop:

  • NameNode memory management and
  • MapReduce performance.

No comments:

Post a Comment